This document can be viewed at: https://timzai.github.io/siads532_strava/

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from itertools import permutations
from itertools import combinations
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
# pio.renderers.default = 'iframe'

%matplotlib inline
In [2]:
%load_ext watermark
%watermark -v -m -p pandas,numpy,matplotlib,plotly
CPython 3.7.3
IPython 7.7.0

pandas 0.25.0
numpy 1.17.0
matplotlib 3.1.1
plotly 4.12.0

compiler   : GCC 7.3.0
system     : Linux
release    : 4.14.70-67.55.amzn1.x86_64
machine    : x86_64
processor  : x86_64
CPU cores  : 32
interpreter: 64bit

Rule's Ten Rules

Rule et al’s Ten simple rules for writing and sharing computational analyses in Jupyter Notebook

The below rules will be followed throughout this assignment:

  • Rule 1: Tell a story for an audience. I tried to find a picture of how Chris' exercise pattern changes over time and I think I managed to do that with heart rate data.
  • Rule 2: Document the process, not just the results. I definitely did document basically my entire process from start to finish, and from the data exploration to the final findings and plots. I have dropped some explorations that lead to nowhere, but anything that gave me insights remained in the notebook.
  • Rule 3: Use cell divisions to make steps clear. Most of my cells are short and serves a single purpose. I also insert Markdown cells between code cells to describe my thought process.
  • Rule 4: Modularize code. I tried to modularize my code into functions and made it so that they can be easily reused with different variables across different datasets or columns.
  • Rule 5: Record dependencies. Dependencies are recorded above using watermark.
  • Rule 7: Build a pipeline. The notebook is able to be re-run from top to bottom and also allows out-of-order cell runs if the user wants to tweak parameters. Variables are generally contained within each section and will not interfere with other sections.

Getting Started

Alright, so I haven't really had any experience with fitness data nor have I really done much running or cycling, so I really want to just start by exploring what the data is about before deciding on how to proceed.

I'll just start by loading the data and sampling from the it.

In [3]:
all_data = pd.read_csv("assets/strava.csv")
print(len(all_data))
display(all_data.sample(5))
40649
Air Power Cadence Form Power Ground Time Leg Spring Stiffness Power Vertical Oscillation altitude cadence datafile ... enhanced_speed fractional_cadence heart_rate position_lat position_long speed timestamp unknown_87 unknown_88 unknown_90
24961 4.0 81.0 102.0 306.0 13.0 319.0 6.5 NaN 82.0 activities/2797454257.fit.gz ... 2.743 0.0 161.0 504616356.0 -999398311.0 NaN 2019-08-21 03:35:36 0.0 300.0 -2.0
5774 NaN NaN NaN NaN NaN NaN NaN 3913.0 79.0 activities/2708973950.fit.gz ... 2.706 0.5 120.0 504435986.0 -999129695.0 2706.0 2019-07-21 22:00:51 0.0 300.0 NaN
39014 NaN NaN NaN NaN NaN NaN NaN NaN 68.0 activities/2917827233.fit.gz ... 6.270 0.0 NaN 505072177.0 -999944330.0 NaN 2019-09-30 22:36:15 0.0 NaN NaN
12855 NaN NaN NaN NaN NaN NaN NaN 3758.0 84.0 activities/2758776303.fit.gz ... 2.958 0.5 166.0 504290792.0 -999078441.0 2958.0 2019-08-07 23:42:19 0.0 300.0 -4.0
40387 NaN NaN NaN NaN NaN NaN NaN NaN 80.0 activities/2925939753.fit.gz ... 7.446 0.0 137.0 504814444.0 -999640266.0 NaN 2019-10-03 22:45:55 0.0 300.0 NaN

5 rows × 22 columns

First Look at the Data

Alright, so we got 22 columns and a timestamp column that we can probably infer to datetime. Notice that there are already a ton of NaNs, this is expected because Chris said that a few sensors were used and they may not be present during all exercises.

Let's do a quick summary of the data using .dtypes, .isna().sum(), and .nunique().

In [4]:
all_data = pd.read_csv("assets/strava.csv", parse_dates=['timestamp'], infer_datetime_format=True)
all_data_summary = pd.DataFrame([all_data.dtypes,all_data.isna().sum(),all_data.nunique()], index=['dtypes','isna().sum()','nunique()']).T
display(all_data_summary)
dtypes isna().sum() nunique()
Air Power float64 22807 40
Cadence float64 22802 43
Form Power float64 22807 82
Ground Time float64 22802 211
Leg Spring Stiffness float64 22807 83
Power float64 22802 306
Vertical Oscillation float64 22802 54
altitude float64 25744 688
cadence float64 22 112
datafile object 0 64
distance float64 0 39062
enhanced_altitude float64 51 733
enhanced_speed float64 10 1241
fractional_cadence float64 22 2
heart_rate float64 2294 127
position_lat float64 192 36988
position_long float64 192 38448
speed float64 25721 365
timestamp datetime64[ns] 0 40649
unknown_87 float64 22 1
unknown_88 float64 2294 2
unknown_90 float64 22031 20

Digging Deeper

So a few things that I could quickly infer from this:

  1. timestamp is unique since nunique() is same as len() of alldata
  2. confirmed that there is a lot of NaN data like the earlier observation
  3. there were 64 exercise sessions
  4. basically everything are floats, so a lot of numeric analysis can be done already

Since there are a lot of NaNs, I think it'll be a good idea to see the how much of each column's non-NaN overlap with others. I'll use a scatter plot to represent the matrix of column combinations with color representing the overlap count. This shoould allow me to easily visualize which columns have higher count of overlap with other columns.

In [5]:
overlaps = pd.DataFrame(permutations(all_data.columns,2),columns=['x','y'])
overlaps['num_rows'] = overlaps.apply(lambda r: all_data[r.x].notna().sum() & all_data[r.y].notna().sum(), axis=1)
overlaps.sample(5)
Out[5]:
x y num_rows
228 distance unknown_87 40577
210 distance Air Power 1152
40 Cadence unknown_88 1427
70 Ground Time cadence 1203
400 unknown_87 Cadence 1203
In [6]:
sc = plt.scatter(overlaps.x, overlaps.y, c=overlaps.num_rows)
plt.colorbar(sc)
plt.show()

Analyzing Overlaps

Okay! This is totally what I wanted - except that "Air Power" is weirdly situated, the axis labels are a mess and I think that I can probably remove a few columns that I definitely won't be using, or remove one of lat/long since those appear to come in pairs. I'm already seeing some block-like patterns indicating that certain columns probably appear with others more frequently (same sensors perhaps!)

In [7]:
alt_columns = list(all_data.columns)

# these exists for all entries, so we can remove them (distance is still there)
alt_columns.remove("datafile")
alt_columns.remove("timestamp")

# same as position_lat
alt_columns.remove("position_long")

# I don't know what these are also
alt_columns.remove("unknown_87")
alt_columns.remove("unknown_88")
alt_columns.remove("unknown_90")

# because enhanced is better
alt_columns.remove("speed")
alt_columns.remove("altitude")

# there is cadence
alt_columns.remove("fractional_cadence")

overlaps2 = pd.DataFrame(permutations(alt_columns,2),columns=['x','y'])
overlaps2['num_rows'] = overlaps2.apply(lambda r: all_data[r.x].notna().sum() & all_data[r.y].notna().sum(), axis=1)

sc2 = plt.scatter(overlaps2.x, overlaps2.y, c=overlaps2.num_rows)
plt.colorbar(sc2)
plt.show()

Analyzing Overlaps (cont.)

OK that looks cleaner, but I think I am thinking I should use another library so I can hover over the points to see their information. It'll be much easier to pick from the matrix interactively. I'll use plotly for this.

In [8]:
fig = px.scatter(overlaps2, x="x", y="y", color="num_rows", color_continuous_scale=px.colors.sequential.matter)
fig.update_yaxes(title=None, categoryarray=alt_columns)
fig.update_xaxes(title=None)
fig.update_traces(marker=dict(size=16))
fig.show()

Analyzing Overlaps (cont. 2)

OK that is exactly what i wanted, I can see that we won't be able to analyze certain combinations due to some of the data not overlapping much, but the orange and dark purple dots are all valid choices to do further cross analysis on.

But wait! Although we have a pretty good understanding of which columns came from the same sensors, let's just triple check by checking overlaps for 3 columns at once! I think this is a good idea to use 3D so we can move the viewpoint around to see whether there are specific column combinations that yield more overlap than others.

In [9]:
overlaps3 = pd.DataFrame(permutations(alt_columns,3),columns=['x','y','z'])
overlaps3['num_rows'] = overlaps3.apply(lambda r: all_data[r.x].notna().sum() & all_data[r.y].notna().sum() & all_data[r.z].notna().sum(), axis=1)
overlaps3 = overlaps3[overlaps3['num_rows'] > 5000]
In [10]:
fig = px.scatter_3d(overlaps3, x="x", y="y", z="z", color="num_rows", range_color=(0,overlaps3.num_rows.max()),
                    color_continuous_scale=[[0, 'rgba(255, 255, 50, 0.5)'], [0.5, 'rgba(208, 71, 85, 1)'], [1.0, 'rgba(48, 15, 62, 1)']])
fig.update_layout(scene = dict(
                    xaxis = dict(title=None, categoryarray=alt_columns),
                    yaxis = dict(title=None, categoryarray=alt_columns),
                    zaxis = dict(title=None, categoryarray=alt_columns)))
fig.update_traces(marker=dict(size=4))
fig.show()

Analyzing Overlaps in 3D

I ended up hiding the low counts to come up with a readable 3D plot. This plot clearly shows that there are pretty much 2 clusters of data with lots of overlaps. The 3D plot wasn't as cool as I'd hope it would be, but it was what I expected - the points from the sensors being added together resembling cube blocks of data.

Data Exploration with SPLOMs

Now I already did SPLOMs for assignment 3, but I felt that they were super helpful in identifying trends with an open mind. I will do a few SPLOMs with nothing in mind particularly, but would like to have some reusable code to easily generate SPLOMs.

For the initial SPLOMs, I'll look into the heart_rate, distance, and speed data separately first, according to each exercise session. I'll also add in duration and timestamp so we could explore if there are trends to be found related to Chris' exercise time and whether he made overall improvements as time passes.

In [11]:
def load_data(c):
    # Loads data with specific columns in mind
    cols=list(c.keys())
    df = pd.read_csv("assets/strava.csv", parse_dates=['timestamp'], infer_datetime_format=True)
    df = df[['timestamp','datafile']+cols].dropna(subset=cols)
    return df
In [12]:
def summarize_data(d):
    # Aggregate based on datafile (which indicates 1 exercise session) with columns and aggregate functions.
    df = load_data(d)
    d['timestamp'] = ['min', 'max']
    df = df.groupby(by="datafile").agg(d)
    df.columns = df.columns.map('_'.join)
    df['duration'] = df.loc[:,'timestamp_max'] - df.loc[:,'timestamp_min']
    df = df[df.duration > pd.Timedelta('10 min')] # you ain't working out if it's less than 10 minutes!
    df['duration_s'] = df['duration'] / np.timedelta64(1, 's')
    df['time'] = df['timestamp_min'].astype(int)/ 10**9
    return df
In [13]:
def plot_splom(df,cols,size=None):
    # plot some generic sploms!
    if size == None:
        size = (len(cols)*2,len(cols)*2)
    axes = scatter_matrix(df[cols], alpha = 0.2, figsize = size, diagonal = 'kde')
    corr = df[cols].corr().as_matrix() # let's show correlations on each subplot
    for i, j in zip(*plt.np.triu_indices_from(axes, k=1)):
        axes[i, j].annotate("%.3f" %corr[i,j], (0.8, 0.8), xycoords='axes fraction', ha='center', va='center')
    return axes

Lets start with a heart rate SPLOM.

In [14]:
hr_cols = {'heart_rate': ['min', 'max', 'mean','std']}
hr_splom_cols = ['time','duration_s','heart_rate_min','heart_rate_max','heart_rate_mean','heart_rate_std']
hr_sessions = summarize_data(hr_cols)
axes = plot_splom(hr_sessions, hr_splom_cols);
axes[3,4].patch.set_facecolor((1, 1, .66))
axes[0,3].patch.set_facecolor((1, .9, .9))
axes[0,4].patch.set_facecolor((1, .9, .9))

Nothing too interesting here except for a positive correlation between heart rate max and mean (highlighted in yellow). It's not too surprising but it's also interesting that heart rate min is less correlated. There are also some other moderated correlation between time (date and time of the exercise) and heart rate - indicating that Chris has somewhat been working his heart harder as time passed (highlighted in red).

Lets look at the distance SPLOM.

In [15]:
distance_cols = {'distance': ['max']}
distance_splom_cols = ['time','duration_s','distance_max']
distance_sessions = summarize_data(distance_cols)
axes = plot_splom(distance_sessions, distance_splom_cols);
axes[1,2].patch.set_facecolor((1, 1, .66))

OK, I wasn't expecting something like this to show up but it appears we got a V with one of our subplots (highlighted in yellow). The reason for this is because Chris has been doing running and cycling, which have different speeds which is captured by this specific plot in the SPLOM (distance / duration). Since we're talking about speed, let's see if anything shows up in the speed SPLOM.

Lets look at the speed SPLOM then!

In [16]:
speed_cols = {'speed': ['mean','max','std']}
speed_splom_cols = ['time', 'duration_s', 'speed_max', 'speed_mean', 'speed_std']
speed_sessions = summarize_data(speed_cols)
axes=plot_splom(speed_sessions, speed_splom_cols);
axes[0,2].patch.set_facecolor((1, 1, .66))

It was a bit surprising to not spot any clear trends or clusters here initially, but after looking at the speed_max plots, it seems that we have an outlier with max speed of 7744. If we were to remove that, we may have a more normal speed_max to compare - but it may also result in being very similar to speed_mean where there wasn't really anything interesting.

SPLOM Summary

So with the convenience functions, we can probably continue to generate more SPLOM, but we will stop here. I would like to move on to timeseries with heart rate data since I am interested in seeing if Chris had any changes in his exercise routines.

Heart Rate Data Analysis

Heart rate data is something that I am kind of interested in looking into - when I did exercise before my coach had a strict heart rate range I must stay in. Let's see if Chris' heart rate data has any patterns!

I would like to analyze something related to the heart rate zone.

(image from here)

While Chris isn't sure what his maximum heart rate is, he suggested that perhaps 180 is a reasonable value.

Let's begin by plotting heart rate against time. We'll get all rows in the strava file with a non-NaN value for heart_rate. I also want to resample the data to per second data and interpolate the missing data.

In [17]:
def load_heart_rate_data(datafile=[]):
    df = pd.read_csv("assets/strava.csv", parse_dates=['timestamp'], infer_datetime_format=True)
    df = df[['timestamp','datafile','heart_rate']].dropna(subset=['heart_rate'])
    df.set_index('timestamp', inplace=True)
    df = df[df['datafile'].isin(datafile)]
    df = df.resample('s').interpolate()
    df.drop(columns=['datafile'], inplace=True)
    return df
In [18]:
# picking a random datafile to see how things look first
# I picked `activities/2770194670.fit.gz` because it has a good variance of heart rate data for the zones
picked_datafile = "activities/2770194670.fit.gz"
hr_df = load_heart_rate_data([picked_datafile])

fig = go.Figure()
fig.add_trace(go.Scatter(x=hr_df.index, y=hr_df.heart_rate))
fig.show()

Adding in the Heart Rate Zones

Next, let's try to fill in the zones with the right colors. I'm going to define max_hr and zones as variables with Chris' max heart rate as well as the zone ranges from above.

In [19]:
MAX_HR = 180
DEFAULT_ZONES = {0:'lightgrey', 50:'silver', 60:'mediumturquoise', 70:'yellowgreen', 80:'gold', 90:'crimson'}

def zone_bins(zones = DEFAULT_ZONES, max_hr = MAX_HR):
    bin_values = [max_hr*z/100 for z in zones.keys()] + [np.inf]
    bin_names = ["{}%-{}%".format(a,b) if b != np.inf else "{}%+".format(a) for (a,b) in list(zip(list(zones.keys()), list(zones.keys())[1:]+[np.inf]))]
    return (bin_values,bin_names)
In [20]:
def load_heart_rate_data_with_zone(datafile=[],
                                   max_hr=MAX_HR,
                                   zones = DEFAULT_ZONES):
    bin_values, bin_names = zone_bins(zones, max_hr)
    df = load_heart_rate_data(datafile)
    df['hr_binned'] = pd.cut(df.heart_rate, bin_values, labels=bin_names)
    df['hr_zone_color'] = pd.cut(df.heart_rate, bin_values, labels=zones.values())
    return df
In [21]:
hrz_df = load_heart_rate_data_with_zone([picked_datafile])

fig = go.Figure()
for color, hrzc_df in hrz_df.groupby('hr_zone_color'):
    if len(hrzc_df) == 0:
        continue
    fig.add_trace(go.Bar(x=hrzc_df.index,
                       y=hrzc_df.heart_rate,
                       name=hrzc_df.iloc[0].hr_binned,
                       marker={'color': hrzc_df.hr_zone_color, 'line.width':0}))

fig.add_trace(go.Scatter(x=hrz_df.index, y=hrz_df.heart_rate,
                         line=dict(color='darkgrey', width=1), showlegend=False))

fig.update_layout(
    title = 'Heart Rate Zones',
    xaxis = dict(title='Date and Time'),
    yaxis = dict(title='Heart Rate (bpm)'),
    bargap=0
)

fig.show()

Adding Rolling Averages

Alright! That took a long longer than expected for me to figure out, but it was the end result that I kind of wanted. I'm not sure about the very narrow lines though, maybe a rolling mean will help remove some of that.

In [22]:
hrz_df = load_heart_rate_data_with_zone([picked_datafile])
hrz_df.heart_rate = hrz_df.heart_rate.rolling('20s').mean()
fig = go.Figure()
for color, hrzc_df in hrz_df.groupby('hr_zone_color'):
    if len(hrzc_df) == 0:
        continue
    fig.add_trace(go.Bar(x=hrzc_df.index,
                       y=hrzc_df.heart_rate,
                       name=hrzc_df.iloc[0].hr_binned,
                       marker={'color': hrzc_df.hr_zone_color, 'line.width':0}))

fig.add_trace(go.Scatter(x=hrz_df.index, y=hrz_df.heart_rate,
                         line=dict(color='darkgrey', width=1), showlegend=False))

fig.update_layout(
    title = 'Heart Rate Zones with 20s Rolling Mean',
    xaxis = dict(title='Date and Time'),
    yaxis = dict(title='Heart Rate (bpm)'),
    bargap=0
)

fig.show()

OK, I guess no dice on the narrow bars. I'm going to keep the rolling 20s means thought because I think it has a good balance for smoothness and keeps enough details.

Heart Rate Zones Alternate Analysis

There's another chart that I want to show though - and that's seeing how much of each heart rate zone Chris stays in during his exercises. I think a bar chart could do that justice.

In [23]:
def plot_hr_zone_distribution(df):
    fig = go.Figure(go.Bar(
                x=df, y=df.index, orientation='h',
                marker_color=list(DEFAULT_ZONES.values())
    ))

    fig.update_layout(
        title = 'Heart Rate Zones Distribution',
        xaxis = dict(title='Time in Zone (minutes)'),
        yaxis = dict(title='Heart Rate Zone'),
    )
    fig.show()
In [24]:
plot_hr_zone_distribution(hrz_df.hr_binned.value_counts(sort=False)/60)

Heart Rate Zones Analysis Across Sessions

Finally, I want to be able to compare the distribution of heart rate zones across sessions. I think a heatmap can show any sort of patterns.

In [25]:
df = pd.read_csv("assets/strava.csv", parse_dates=['timestamp'], infer_datetime_format=True)
df = df[['timestamp','datafile','heart_rate']].dropna(subset=['heart_rate'])
df.set_index('timestamp', inplace=True)
bin_values, bin_names = zone_bins(DEFAULT_ZONES, MAX_HR)
all_df = []

for datafile, adf in df.groupby('datafile'):
    time = adf.index[0].strftime('%B %d, %Y, %r')
    adf = adf.resample('s').interpolate()
    adf['hr_binned'] = pd.cut(adf.heart_rate, bin_values, labels=bin_names)
    adf = adf.groupby(['hr_binned']).count().reset_index()
    adf['percent'] = 100 * adf['heart_rate']  / adf['heart_rate'].sum()
    adf['timestamp'] = time
    adf['datafile'] = datafile
    all_df += [adf]

resampled_df = pd.concat(all_df)
In [26]:
fig = go.Figure(data=go.Heatmap(z=resampled_df.percent, x=resampled_df.timestamp, y=resampled_df.hr_binned, hoverongaps = False))
fig.update_layout(autosize=False,
                  title = 'Heart Rate Zones Distribution Across Sessions',
                  xaxis = dict(title='% Time in Zone', showticklabels=False),
                  yaxis = dict(title='Heart Rate Zone')
                 )
fig.show()

Exercise Pattern Changes

Here we can see that Chris' exercise pattern has changed over time. He is spending more time in the 80-90% ranges in his later sessions compared to earlier ones, where more time is spent in the 60-70% ranges.

Something Alarming...

There's also something really alarming about his exercise on August 17th, where he spent 62% of his time (19 minutes) in the 90%+ zone! I don't think you're supposed to stay in the 90%+ zone that much, as it can be dangerous! I hope that either the max heart rate is not 180 or something is wrong with the sensor that day...

In [27]:
display(resampled_df[resampled_df['timestamp'] == "August 17, 2019, 11:50:36 AM"])
hr_binned datafile heart_rate percent timestamp
0 0%-50% activities/2786247269.fit.gz 0 0.000000 August 17, 2019, 11:50:36 AM
1 50%-60% activities/2786247269.fit.gz 3 0.164024 August 17, 2019, 11:50:36 AM
2 60%-70% activities/2786247269.fit.gz 4 0.218699 August 17, 2019, 11:50:36 AM
3 70%-80% activities/2786247269.fit.gz 52 2.843084 August 17, 2019, 11:50:36 AM
4 80%-90% activities/2786247269.fit.gz 622 34.007654 August 17, 2019, 11:50:36 AM
5 90%+ activities/2786247269.fit.gz 1148 62.766539 August 17, 2019, 11:50:36 AM

Let's pull up the earlier bar chart with the exercise data from this session!

In [28]:
hrz_df2 = load_heart_rate_data_with_zone(["activities/2786247269.fit.gz"])
plot_hr_zone_distribution(hrz_df2.hr_binned.value_counts(sort=False)/60)